Informediatm: News-on-demand Experiments in Speech Recognition

نویسندگان

  • Howard D. Wactlar
  • Alexander G. Hauptmann
  • Michael J. Witbrock
چکیده

In theory, speech recognition technology can make any spoken words in video or audio media usable for text indexing, search and retrieval. This article describes the News-on-Demand application created within the InformediaTM Digital Video Library project and discusses how speech recognition is used in transcript creation from video, alignment with closed-captioned transcripts, audio paragraph segmentation and a spoken query interface. Speech recognition accuracy varies dramatically depending on the quality and type of data used. Informal information retrieval test show that reasonable recall and precision can be obtained with only moderate speech recognition accuracy. 1. INFORMEDIA: NEWS-ON-DEMAND The InformediaTM digital video library project [1,2,3,4] at Carnegie Mellon University is creating a digital library of text, images, videos and audio data available for full content search and retrieval. News-on-Demand is an application within Informedia that monitors news from TV, radio and text sources and allows the user to retrieve news stories of interest. A compelling application of the Informedia project is the indexing and retrieval of television, radio and text news. The Informedia: News-on-Demand application [5,6] is an innovative example of indexing and searching broadcast news video and news radio material by text content. News-on-Demand is a fullyautomatic system that monitors TV, radio and text news and allows selective retrieval of news stories based on spoken queries. The user may choose among the retrieved stories and play back news stories of interest. The system runs on a Pentium PC using MPEGI video compression. Speech recognition is done on a separate platform using the Sphinx-II continuous speech recognition system [7]. The News-on-Demand application forces us to consider the limits of what can be done automatically and in limited time. News events happen daily and it is not feasible to process, segment and label news through manual or “human-assisted” methods. Timeliness of the library information is important, as is the ability to continuously update the contents. Thus we are forced to fully exploit the potential of computer speech recognition without the benefit of human corrections and editing. INFORMEDIATM: NEWS-ON-DEMAND EXPERIMENTS IN SPEECH RECOGNITION Howard D. Wactlar, Alexander G. Hauptmann and Michael J. Witbrock School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3890 Even though our work is centered around processing news stories from TV broadcasts, the system exemplifies an approach that can make any video, audio or text data accessible. Similar methods can help to index and search other streamed multi-media data by content in other Informedia applications. Other attempts at solutions have been obliged to restrict the data to only text material, as found in most news databases. Video-ondemand allows a user to select (and pay for) a complete program, but does not allow selective retrieval. The closest approximation to News-on-Demand can be found in the “CNN-AT-WORK” system offered to businesses by a CNN/Intel venture. At the heart of the CNN-AT-WORK solution is a digitizer that encodes the video into the INDEO compression format and transmits it to workstations over a local area network. Users can store headlines together with video clips and retrieve them at a later date. However, this service depends entirely on the separately transmitted “headlines” and does not include other news sources. In addition, CNN-AT-WORK does not feature an integrated multimodal query interface [8]. Preliminary investigations on the use of speech recognition to analyze a news story were made by [9]. Without a powerful speech recognizer, their approach used a phonetic engine that transformed the spoken text into an (errorful) phoneme string. The query was also transformed into a phoneme string and the database searched for the best approximate match. Errors in recognition, as well as word prefix and suffix differences did not severely affect the system since they scattered equally over all documents and wellmatching search scores dominate the retrieval. Another news processing systems that includes video materials is the MEDUSA system [10]. The MEDUSA news broadcast application can digitize and record news video and teletext, which is equivalent to closed-captions. Instead of segmenting the news into stories, the system uses overlapping windows of adjacent text lines for indexing and retrieval. During retrieval the system responds to typed requests returning an ordered list of the most relevant news broadcasts. Query words are stripped of suffixes before search and the relevance ranking takes word frequency in the segment and over all the corpus into account, as well as the ability of words to discriminate between stories. Within a news broadcast, it is up to the user to select and play a region using information given by the system about the location of the matched keywords. The focus of MEDUSA is in the system architecture and the information retrieval component. No image processing and no speech recognition is performed. 1.1. Component Technologies There are three broad categories of technologies we can bring This material is based upon work supported by the National Science Foundation under Cooperative Agreement No. IRI-9411299. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. to bear to create and search a digital video library from broadcast video and audio materials [11]: Text processing looks at the textual (ASCII) representation of the words that were spoken, as well as other text annotations. These may be derived from the transcript, from the production notes or from the closed-captioning that might be available. Text analysis can work on an existing transcript to help segment the text into paragraphs [12]. An analysis of keyword prominence allows us to identify important sections in the transcript [13]. Other more sophisticated language based criteria are under investigation. We currently use two main techniques for text analysis: 1. If we have a complete time aligned transcript available from the closed-captioning or through a human-generated transcription, we can exploit natural “structural” text markers such as punctuation to identify news story boundaries 2. To identify and rank the contents of one news segment, we use the well-known technique of TF/IDF (term frequency/ inverse document frequency) to identify critical keywords and their relative importance for the video document[13]. Image analysis looks at the images in the video portion of the MPEG stream. This analysis is primarily used for the identification of scene breaks and to select static frame icons that are representative of a scene. Image statistics are computed for Library Creation Library Exploration Offline Online TV News Radio News Text News

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Informedia: News-on-Demand Multimedia Information Acquisition

In theory, speech recognition technology can make any spoken words in video or audio media subject to text indexing, search and retrieval. This article describes the News-on-Demand application created within the InformediaTM Digital Video Library project and discusses how speech recognition is used for transcript creation from video, time alignment of closed-captioned transcripts, a speech quer...

متن کامل

Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its perfor...

متن کامل

Informedia News-on-Demand: Using Speech Recognition to Create a Digital Video Library

In theory, speech recognition technology can make any spoken words in video or audio media usable for text indexing, search and retrieval. This article describes the News-on-Demand application created within the InformediaTM Digital Video Library project and discusses how speech recognition is used in transcript creation from video, alignment with closed-captioned transcripts, audio paragraph s...

متن کامل

Improving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms

One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998